This report will be using R and exploratory data analysis techniques to look at a dataset about white wine quality. The dataset that will be explored in this analysis is “Modeling wine preferences by data mining from physicochemical properties”. The reference information can be found in the References section at the end of this report.
The dataset contains several physicochemical attributes from samples of white wine of the Portuguese “Vinho Verde” and has sensory classifications made by wine experts.
Taking a look at the data, this is what we find:
## 'data.frame': 4898 obs. of 12 variables:
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are 12 variables and 4898 observations as we can see in the data. The variables are all of the numeric type, with the Quality being explicitly of the integer type.
The variables are based on the physicochemical tests, and are as follows along with explanations of what they entail:
Alcohol: the alcoholic percentage content of the wine. Measured as percentage by volume.
Residual sugar: the amount of sugar remaining after fermentation stops, it is rare to find wines with less than 1 gram per liter and wines with greater than 45 grams per liter are considered sweet. Measured as grams per decimeters cubed.
Chlorides: the amount of salt in the wine. Measured as grams per decimeters cubed.
Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. Measured as potassium sulfate in grams per decimeters cubed.
Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. Measured as milligrams per decimeters cubed.
Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but as free SO2 concentrations exceed 50 ppm, SO2 becomes evident in the nose and taste of wine. Measured as milligrams per decimeters cubed.
Citric acid: in small quantities, citric acid can add ‘freshness’ and flavor to wines. Measured as grams per decimeters cubed.
Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. Measured as acetic acid grams per decimeters cubed.
Fixed acidity: acid fixed or nonvolatile (does not evaporate readily). Measured as tartaric acid grams per decimeters cubed.
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Density: the density of wine is close to that of water depending on the percent of alcohol and sugar content. Measured as grams per centimeter cubed.
Quality: This is the score assigned by wine experts and is the output from all the other variables; Using a score between 0 and 10, with 0 being poor wine quality and 10 being exceptional wine quality.
A summary of the data shows its variability as shown:
## alcohol residual.sugar chlorides sulphates
## Min. : 8.00 Min. : 0.600 Min. :0.00900 Min. :0.2200
## 1st Qu.: 9.50 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.:0.4100
## Median :10.40 Median : 5.200 Median :0.04300 Median :0.4700
## Mean :10.51 Mean : 6.391 Mean :0.04577 Mean :0.4898
## 3rd Qu.:11.40 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.:0.5500
## Max. :14.20 Max. :65.800 Max. :0.34600 Max. :1.0800
## free.sulfur.dioxide total.sulfur.dioxide citric.acid volatile.acidity
## Min. : 2.00 Min. : 9.0 Min. :0.0000 Min. :0.0800
## 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.2700 1st Qu.:0.2100
## Median : 34.00 Median :134.0 Median :0.3200 Median :0.2600
## Mean : 35.31 Mean :138.4 Mean :0.3342 Mean :0.2782
## 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.3900 3rd Qu.:0.3200
## Max. :289.00 Max. :440.0 Max. :1.6600 Max. :1.1000
## fixed.acidity pH density quality
## Min. : 3.800 Min. :2.720 Min. :0.9871 Min. :3.000
## 1st Qu.: 6.300 1st Qu.:3.090 1st Qu.:0.9917 1st Qu.:5.000
## Median : 6.800 Median :3.180 Median :0.9937 Median :6.000
## Mean : 6.855 Mean :3.188 Mean :0.9940 Mean :5.878
## 3rd Qu.: 7.300 3rd Qu.:3.280 3rd Qu.:0.9961 3rd Qu.:6.000
## Max. :14.200 Max. :3.820 Max. :1.0390 Max. :9.000
A visualization of the variability of each variable by plotting each using a boxplot will provide a baseline:
## Using as id variables
Now taking a look at each individual variable to explore it more closely.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The alcohol concentration distribution is right skewed a little. The highest peak of the distribution is at 9.5 percent alcohol and the median value is 10.40 percent. The maximum amount of alcohol present in the observations is 14.20 percent by volume.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Taking a closer look:
The distribution of residual sugar has a median value of 5.2 g/dm^3. The distribution is right skewed with a long tail on the right side. There are several observations that appear to possibly be outliers to the far right. A second plot with them removed is shown as well for clarity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Taking a closer look:
The distribution of chlorides in the wine samples has a median value of 0.043 g/dm^3. It looks like there are outliers to the right along its tail, with its max at 0.346 g/dm^3. A second plot with them removed is shown.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
And here is a closer look:
The distribution of sulphates is slightly right skewed. The median value of the sulphates is 0.470 and most of the wines have a concentration between 0.410 and 0.550.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
A closer look at it:
The distribution of free sulfur dioxide is shown, and is right skewed, with a maximum of 289. There appear to be some outliers as there are few observations between 100 and 289. The median value is 34 mg/dm^3 of free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
A zoomed in look:
The distribution of total sulfur dioxide is right skewed with a median value of 134 mg/dm^3. There appears to be some outliers, as there are few observations between roughly 260 and 440.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Looking closer at it:
Median of the wines tested have .320 g/dm^3 of citric acid, this acid is usually only found in very small concentrations in wine it seems. There appear to be some outliers with above 1 g/dm^3 of citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Looking at it zoomed in:
The median value is 0.260. Most of the observations fall in the range 0.210 - 0.320 and outliers are on the higher end of the range roughly above the .9 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Getting a closer look:
The median fixed acidity for the white wines in the dataset is 6.80 g/dm^3. Most of the wines tested have an acidity between 6.30 and 7.30. The distribution of fixed acidity is slightly right skewed and there are some outliers in the higher range of roughly above 10.5 g/dm^3. There is a maxium of 14.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
All wines typically have a low pH level. Acids are produced through the fermentation process. The median value is 3.180, and most wines have a pH between 3.090 and 3.280.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
A zoomed in view:
The density of the observations varies only a little, with most of the values being between 0.9917 and 0.9961. This would make sense, as wine has a density close to that of water. The distributions median value is 0.9937 g/cm^3.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
It appears that the distribution of wine quality appears to be normal with many wines at an average quality rating of 5 or 6. There are no wines with a quality lower than 3 and no wines higher than a quality rating of 9.
What is the structure of your dataset?
The dataset has 12 variables regarding 4898 observations. Each observation corresponds to a white wine sample of the Portuguese “Vinho Verde”. Of the variables, 11 correspond to the results of a physicochemical test and one variable (quality) corresponds to the result of a sensory panel rating by wine experts.
What is/are the main feature(s) of interest in your dataset?
The main feature of interest in the dataset is the quality rating of each sample.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
The physicochemical test results may help support the investigation into the dataset. All of them are related to characteristics which may affect the flavor profile of the wine. They correspond to concentrations of molecules which may have an overall impact on taste, and by extension, the quality rating of the wine. Density is a physical property which will depend on the percentage of alcohol and sugar content, which will also affect taste of the wine.
Some variables may have a stronger correlation with each other. For instance, the pH will depend on the amount of acid concentration, while total sulfur dioxide may have a similar distribution to that of free sulfur dioxide levels.
Did you create any new variables from existing variables in the dataset?
No new variables were created in the dataset for this analysis.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
There were no unusual distributions. There were also no missing values and no need to adjust the data. It was already a tidy dataset. There were some outliers in the data that were noted, and these might have been due to an input error when recording the data.
## [1] "Median of alcohol by quality:"
## wine$quality: 3
## [1] 10.45
## ------------------------------------------------------------
## wine$quality: 4
## [1] 10.1
## ------------------------------------------------------------
## wine$quality: 5
## [1] 9.5
## ------------------------------------------------------------
## wine$quality: 6
## [1] 10.5
## ------------------------------------------------------------
## wine$quality: 7
## [1] 11.4
## ------------------------------------------------------------
## wine$quality: 8
## [1] 12
## ------------------------------------------------------------
## wine$quality: 9
## [1] 12.5
Besides the small downward dip in the quality at the 5 rating level, the higher the alcohol content, the higher rating the wine seems to be given.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
As we can see from the Pearson correlation test, there is a decent positive correlation between the alcohol content of a wine sample and what quality rating it receives.
## [1] "Median of residual.sugar by quality:"
## wine$quality: 3
## [1] 4.6
## ------------------------------------------------------------
## wine$quality: 4
## [1] 2.5
## ------------------------------------------------------------
## wine$quality: 5
## [1] 7
## ------------------------------------------------------------
## wine$quality: 6
## [1] 5.3
## ------------------------------------------------------------
## wine$quality: 7
## [1] 3.65
## ------------------------------------------------------------
## wine$quality: 8
## [1] 4.3
## ------------------------------------------------------------
## wine$quality: 9
## [1] 2.2
Now taking a closer look by limiting the Y axis:
Residual sugar seems to have a low impact on the quality rating of the wines. It is interesting that at the rating level of 9, the residual sugar tends to be lower than at the rating level of 3, even though for the mid ranged rating levels it tends to go up.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
The correlation test shows similar insights into the data.
## [1] "Median of chlorides by quality:"
## wine$quality: 3
## [1] 0.041
## ------------------------------------------------------------
## wine$quality: 4
## [1] 0.046
## ------------------------------------------------------------
## wine$quality: 5
## [1] 0.047
## ------------------------------------------------------------
## wine$quality: 6
## [1] 0.043
## ------------------------------------------------------------
## wine$quality: 7
## [1] 0.037
## ------------------------------------------------------------
## wine$quality: 8
## [1] 0.036
## ------------------------------------------------------------
## wine$quality: 9
## [1] 0.031
Now looking at it a little closer:
Possibly a slight relation. Seems that less chlorides could mean a higher quality wine rating. Interesting that it seems to first be a positive relation up until the level 5 rating, then begin declining as a possible negative relationship.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
The correlation test shows similar insights into the data, with a slight negative correlation found.
## [1] "Median of sulphates by quality:"
## wine$quality: 3
## [1] 0.44
## ------------------------------------------------------------
## wine$quality: 4
## [1] 0.47
## ------------------------------------------------------------
## wine$quality: 5
## [1] 0.47
## ------------------------------------------------------------
## wine$quality: 6
## [1] 0.48
## ------------------------------------------------------------
## wine$quality: 7
## [1] 0.48
## ------------------------------------------------------------
## wine$quality: 8
## [1] 0.46
## ------------------------------------------------------------
## wine$quality: 9
## [1] 0.46
Now taking a more zoomed in look at the trend.
There seems to be very little relationship between sulphates and quality.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and sulphates
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02571007 0.08156172
## sample estimates:
## cor
## 0.05367788
Very little correlation is found.
## [1] "Median of free.sulfur.dioxide by quality:"
## wine$quality: 3
## [1] 33.5
## ------------------------------------------------------------
## wine$quality: 4
## [1] 18
## ------------------------------------------------------------
## wine$quality: 5
## [1] 35
## ------------------------------------------------------------
## wine$quality: 6
## [1] 34
## ------------------------------------------------------------
## wine$quality: 7
## [1] 33
## ------------------------------------------------------------
## wine$quality: 8
## [1] 35
## ------------------------------------------------------------
## wine$quality: 9
## [1] 28
There seems to be little relation here.
According to the information that was provided with the dataset, when free SO2 is lower than 50 ppm (~ 50 mg/L), it is undetectable to humans. In the following plot there are very few wines that are above this level which suggests that the variations seen in this plot are not related to an effect of the free SO2, but to the unbalanced distribution of wines across the quality ratings.
So only a little correlation would be expected.
##
## Spearman's rank correlation rho
##
## data: quality_as_int and free.sulfur.dioxide
## S = 1.912e+10, p-value = 0.09703
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02371338
And little correlation is found.
## [1] "Median of total.sulfur.dioxide by quality:"
## wine$quality: 3
## [1] 159.5
## ------------------------------------------------------------
## wine$quality: 4
## [1] 117
## ------------------------------------------------------------
## wine$quality: 5
## [1] 151
## ------------------------------------------------------------
## wine$quality: 6
## [1] 132
## ------------------------------------------------------------
## wine$quality: 7
## [1] 122
## ------------------------------------------------------------
## wine$quality: 8
## [1] 122
## ------------------------------------------------------------
## wine$quality: 9
## [1] 119
Similar to free sulfur dioxide concentrations.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
Interestingly, there is a bit more of a negative correlation found than comparatively to the correlation of that for free sulfur dioxide.
## [1] "Median of citric.acid by quality:"
## wine$quality: 3
## [1] 0.345
## ------------------------------------------------------------
## wine$quality: 4
## [1] 0.29
## ------------------------------------------------------------
## wine$quality: 5
## [1] 0.32
## ------------------------------------------------------------
## wine$quality: 6
## [1] 0.32
## ------------------------------------------------------------
## wine$quality: 7
## [1] 0.31
## ------------------------------------------------------------
## wine$quality: 8
## [1] 0.32
## ------------------------------------------------------------
## wine$quality: 9
## [1] 0.36
A zoomed in look:
There seems to not be a relationship here.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and citric.acid
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03720595 0.01880221
## sample estimates:
## cor
## -0.009209091
The correlation test confirms the previous observation of the graph.
## [1] "Median of volatile.acidity by quality:"
## wine$quality: 3
## [1] 0.26
## ------------------------------------------------------------
## wine$quality: 4
## [1] 0.32
## ------------------------------------------------------------
## wine$quality: 5
## [1] 0.28
## ------------------------------------------------------------
## wine$quality: 6
## [1] 0.25
## ------------------------------------------------------------
## wine$quality: 7
## [1] 0.25
## ------------------------------------------------------------
## wine$quality: 8
## [1] 0.26
## ------------------------------------------------------------
## wine$quality: 9
## [1] 0.27
And now a zoomed in look:
There seems to be a slight downward trend until the rating at level 9, which could be to a more limited sample size of quality level 9 wines.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
A very slight negative correlation is found.
## [1] "Median of fixed.acidity by quality:"
## wine$quality: 3
## [1] 7.3
## ------------------------------------------------------------
## wine$quality: 4
## [1] 6.9
## ------------------------------------------------------------
## wine$quality: 5
## [1] 6.8
## ------------------------------------------------------------
## wine$quality: 6
## [1] 6.8
## ------------------------------------------------------------
## wine$quality: 7
## [1] 6.7
## ------------------------------------------------------------
## wine$quality: 8
## [1] 6.8
## ------------------------------------------------------------
## wine$quality: 9
## [1] 7.1
A closer look:
There is a slight trend of a higher quality rating when there is a lower fixed acidity concentration. However, there are less observations at the quality ratings of 3 and 9 compared to middle observations, which may make the median value not very accurate. Additionally, there is a big dispersion of acidity values across each quality scale.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
Only a little negative correlattion is found.
## [1] "Median of pH by quality:"
## wine$quality: 3
## [1] 3.215
## ------------------------------------------------------------
## wine$quality: 4
## [1] 3.16
## ------------------------------------------------------------
## wine$quality: 5
## [1] 3.16
## ------------------------------------------------------------
## wine$quality: 6
## [1] 3.18
## ------------------------------------------------------------
## wine$quality: 7
## [1] 3.2
## ------------------------------------------------------------
## wine$quality: 8
## [1] 3.23
## ------------------------------------------------------------
## wine$quality: 9
## [1] 3.28
And now a closer look:
There seems to be an upward trend here. This could mean that a higher acid concentration in the wine will correlate to a higher quality of wine. This relationship will be checked later on.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
However, seems there is little correlation found.
## [1] "Median of density by quality:"
## wine$quality: 3
## [1] 0.994425
## ------------------------------------------------------------
## wine$quality: 4
## [1] 0.9941
## ------------------------------------------------------------
## wine$quality: 5
## [1] 0.9953
## ------------------------------------------------------------
## wine$quality: 6
## [1] 0.99366
## ------------------------------------------------------------
## wine$quality: 7
## [1] 0.99176
## ------------------------------------------------------------
## wine$quality: 8
## [1] 0.99164
## ------------------------------------------------------------
## wine$quality: 9
## [1] 0.9903
And zooming in shows:
Lower density seems to mean a higher quality rating. There is a trend at the rating of a quality of 5 that breaks this trend slightly though. From the information provided with the dataset, it was stated that the density will depend on the percentage of alcohol and sugar content in the wine. This relationship will be checked later on, but seems like there could be a relation.
##
## Pearson's product-moment correlation
##
## data: quality_as_int and density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
Decent negative correlation is found.
And now taking a zoomed look at it:
And its correlation test results:
##
## Pearson's product-moment correlation
##
## data: residual.sugar and alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
It was expected that a stronger correlation between the alcohol content and the residual sugar would be shown, since the alcohol should be coming from the fermentation of the sugars. However, this is still a decent negative correlation.
Possibly some of the wines are fortified with extra alcohol added after the fermentation process, or the yeast behaves in such a way that does not allow the data to establish a linear relationship between sugar fermentation and alcohol production. There is also the fact that the data does not mention which types of grapes were used, which may have different sugar contents that could impact this relationship.
And now taking a zoomed look at it:
And its correlation test results:
##
## Pearson's product-moment correlation
##
## data: sulphates and total.sulfur.dioxide
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1069590 0.1619585
## sample estimates:
## cor
## 0.1345624
Seems that the addition of the sulphate additive does not have a large correlation to the total sulfur dioxide in the wine samples.
And now taking a zoomed look at it:
And its correlation test results:
##
## Pearson's product-moment correlation
##
## data: sulphates and free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03126264 0.08707928
## sample estimates:
## cor
## 0.05921725
Seems that the addition of the sulphate additive does not have a correlation to the free sulfur dioxide in the white wine samples.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The wine quality rating is higher, and has a stronger relationship, with the 3 variables of chlorides, density and alcohol content. There are two others that are noteworthy, but not as impactful, are the variables of Total Sulfur Dioxide and Volatile Acidity. The correlation coefficients show that the strength of the relationship with the variables is shown below.
## [,1]
## alcohol 0.44036918
## residual.sugar -0.08206979
## chlorides -0.31448848
## sulphates 0.03331897
## free.sulfur.dioxide 0.02371338
## total.sulfur.dioxide -0.19668029
## citric.acid 0.01833273
## volatile.acidity -0.19656168
## fixed.acidity -0.08448545
## pH 0.10936208
## density -0.34835102
There is a negative correlation for density, which makes sense as alcohol would have an inverse relationship to density. And alcohol makes sense as a large contributor to the quality of a wine as well. Chlorides having a negative relation would make a wine sample less salty as its concentration goes down, which could explain how higher rated wines would have a lower concentration of chlorides. Total Sulfur Dioxide is interesting, as is volatile acidity.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
The expected relationship between the alcohol level and density was as expected.
It was of interest to observe the relationship between chlorides and quality.
It was unexpected to not find a stronger relationship between the residual sugar and alcohol concentration, since the alcohol should in theory come from the fermentation of sugars in the wine making process.
What was the strongest relationship you found?
The correlation coefficients show that the variable with the strongest relationship with quality rating is the alcohol concentration.
Here is a correlation matrix of the data:
Quality strongly correlates with alcohol content. And density should go down as the alcohol goes up. So as density decreases, alcohol goes up and the quality rating goes up for the wine in general.
The lowest quality wines have a low alcohol and high density. The middle quality wines (rated 5 and 6) can seem to be found spread throughout the plot area, but more quality level 3 can been seen on the left side of the graph, and more blue (ratings of 8 or 9) towards the right side of the graph.
The trend does seem to be that as the chloride concentration goes down and the alcohol concentration goes up, the quality increases.
The total sulfur dioxide does not have much effect it seems on the quality of the wine.
The volatile acidity of the wine does not seem to have too much of an impact, although it looks like generally the lower the volatile acidity, the better qualaty rating the wine will receive.
Looks like if chlorides are lower, and volatile acidity is lower, the better quality rating the wine will be given.
Lower chlorides and total sulfur dioxide levels between 100 to 200 mg/dm^3 seem to be where the higher quality rated wines samples will fall.
A lower density and lower chlorides tend to a higher quality wine rating.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The main relationships explored was between the biggest correlations with quality.
It has been shown how alcohol, chlorides, density, total sulfur dioxide, and volatile acidity relate to quality. Higher alcohol, lower density, and low chloride concentration will typically give a better wine rating for quality.
There tends to be a range of total sulfur dioxide between 100 to 200 that also gives a better quality of wine.
As chlorides go down, there is an increase in the quality rating of the wines. It is not a very strong relationship, but it is noteworthy nonetheless.
Here is the distribution of alcohol across the different quality ratings. The boxplot shows the quantile boundaries and median values, while the overlapping dots show the actual distribution of the wine samples. There is an odd decline in quality ratings from 3 to 5 as alcohol concentration goes down, but then it significantly goes up from there. There is an unbalanced amount of samples between the middle ratings and the higher and lower quality ratings. There are much more middle quality rated wines than there are low and high quality rated wines. The line connects the median values and helps to show visually the increasing trend of alcohol with quality rating.
In the analysis, it has been shown that it appears that alcohol and chlorides play a significant role on the quality rating a wine will receive. Using this plot, it can be seen that there does seem to be a trend that the lower a wine has of a concentration of chlorides and the higher its concentration of alcohol, the higher it is likely the wine sample will rate for its quality rating. There is also visible the inverse trend, that wine samples with a high concentration of chloride and a low alcohol concentration will have a lower rating for that wine sample.
Working with datasets has the challenge of deciding how to approach the exploration of the data. Because this dataset also came with a description file, it already outlined some possible variables that might lend themselves to exploration. This proved quite useful. For example, when the description file said that citric acid could add a freshness element to the taste of wines, while acetic acid would add an almost vinegar like taste, this provided context to many of the observations that would follow. Another example would be the density of wine being close to water, but lower as its alcohol content increased. This then explained the inverse relationship seen between the variables for density and alcohol in the graphs. This also shows just how important it is to have some knowledge of the subject matter when beginning an analysis, as some contexts could be lost on a casual observer. This knowledge helps structure the approach to the analysis, and allows better theory formation so that meaningful insights can be gotten from the data processing.
Another challenge faced was figuring out how to communicate with the multivariate plots. When adding a third dimension to a plot, mentally visualizing that can become harder for individuals. Use of color helped with this challenging approach, which made it easier to grasp what information was being communicated by each step of the analysis, adding in clarity and depth of information.Using the correlation matrix was a neat addition that was quite welcome.This also helped to narrow down which variables should be focused on for further exploration. Overall, data needs to be communicated in such a way that its story can easily be understood and followed for the reader, so putting extra effort into making it legible proved a good use of time. The dataset already being clean and tidy made working with it significantly easier as a whole as well.
As a way to expand the analysis, bringing in other different types of wines would be an interesting way to see if the trends are strengthened, or weakened, with this new data. Expanding the dataset with the same type of wine, but simply having more observations would also be interesting, as the dataset used in this analysis was not very large. It would also be of interest if some additional variables could be added to the total dataset, such as type of grape, location grown, and how long before the grape was harvested.
In summary, having found and explored the main relationships in the dataset, using these trends to predict how other wines would fare as to their quality rating would be a logical next step. Then gathering data from those observations made on the predicted trends could be made to further refine the process of the prediction in the future.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
StackOverflow website for various questions and research. Available at: http://www.Stackoverflow.com
R Bloggers website for various guides to using R. Available at: http://www.r-bloggers.com
R Markdown website by RStudio Available at: https://rmarkdown.rstudio.com/authoring_pandoc_markdown.html